A hospital readmission is when a patient who is discharged from the hospital, gets re-admitted again within a certain period of time. The rates of hosptial readmission are now considered an indicator of hospital quality and also contributes to the high cost of healthcare in America. In order to adress this issue, the Centers for Medicare & Medicaid Services has established the Hospital Readmissions Reduction Program (HRRP) with the goal of improving the quality of care for patients and reducing healthcare spending. The primary way in which they do this apply bying payment penalties to hospitals that have more than expected readmission rates for certain conditions. While diabetes has not yet been added to HRBP’s list of conditions, Diabetes is the is the [condition with the 3rd most all-cause, 30-day readmissions for Medicaid patients, and in 2011 American hospitals spent over $41 billion on diabetic patients who got readmitted within 30 days of discharge] (https://www.hcup-us.ahrq.gov/reports/statbriefs/sb172-Conditions-Readmissions-Payer.jsp).
The ability to determine which risk factors lead to higher readmission in such patients, and accordingly being able to predict which patients will get readmitted, could help save hospitals billions of dollars while also improving thef quality of care. With this goal in mind, we set out to answer these two questions:
1. How well can we predict 30 day hospital readmission of diabetes patients using medical claims data?
2. What factors are the most important in predicting hosptial readmissions in a diabetic patient?
The dataset we will be working with represents 10 years (1999-2008) of clinical care at 130 US hospitals and integrated delivery networks. It includes over 50 features representing patient and hospital outcomes. Information was extracted from the database for encounters that satisfied the following criteria.
1. It is an inpatient encounter (a hospital admission).
2. It is a diabetic encounter, that is, one during which any kind of diabetes was entered to the system as a diagnosis.
3. The length of stay was at least 1 day and at most 14 days.
4. Laboratory tests were perfocrmed during the encounter.
5. Medications were administered during the encounter.
The data contains such attributes as patient number, race, gender, age, admission type, time in hospital, medical specialty of admitting physician, number of lab test performed, HbA1c test reslt, diagnosis, number of medication, diabetic medications, number of outpatient, inpatient, and emergency visits in the year before the hospitalization, etc.
The dataset is publicly avaliable from the UCI Machine Learning Repository.
To get us started, we read in the dataset and remove any patients that appear more than once in the dataset in order to avoid any bias that might come with this overreprsentation of certain patients. This resulted in dataset being reduced to approximately 70,000 encounters:
# Reading in data
data <- read.csv("dataset/diabetic_data.csv")
# remove duplicates
data <- data[!duplicated(data$patient_nbr), ]
Before moving on to do some data cleaning, let’s observe the descriptive statistics of each column in the training dataset.
The skimr package provides a nice solution to show key descriptive stats for each column.
data[data == "?"] <- NA
skimmed <- skim_to_wide(data)
knitr::kable(skimmed[, c(1:3, 5:6, 9:10, 16)])
| type | variable | missing | n | n_unique | mean | sd | hist |
|---|---|---|---|---|---|---|---|
| factor | A1Cresult | 0 | 71518 | 4 | NA | NA | NA |
| factor | acarbose | 0 | 71518 | 3 | NA | NA | NA |
| factor | acetohexamide | 0 | 71518 | 2 | NA | NA | NA |
| factor | age | 0 | 71518 | 10 | NA | NA | NA |
| factor | change | 0 | 71518 | 2 | NA | NA | NA |
| factor | chlorpropamide | 0 | 71518 | 4 | NA | NA | NA |
| factor | citoglipton | 0 | 71518 | 1 | NA | NA | NA |
| factor | diabetesMed | 0 | 71518 | 2 | NA | NA | NA |
| factor | diag_1 | 11 | 71518 | 696 | NA | NA | NA |
| factor | diag_2 | 294 | 71518 | 725 | NA | NA | NA |
| factor | diag_3 | 1225 | 71518 | 758 | NA | NA | NA |
| factor | examide | 0 | 71518 | 1 | NA | NA | NA |
| factor | gender | 0 | 71518 | 3 | NA | NA | NA |
| factor | glimepiride | 0 | 71518 | 4 | NA | NA | NA |
| factor | glimepiride.pioglitazone | 0 | 71518 | 1 | NA | NA | NA |
| factor | glipizide | 0 | 71518 | 4 | NA | NA | NA |
| factor | glipizide.metformin | 0 | 71518 | 2 | NA | NA | NA |
| factor | glyburide | 0 | 71518 | 4 | NA | NA | NA |
| factor | glyburide.metformin | 0 | 71518 | 4 | NA | NA | NA |
| factor | insulin | 0 | 71518 | 4 | NA | NA | NA |
| factor | max_glu_serum | 0 | 71518 | 4 | NA | NA | NA |
| factor | medical_specialty | 34477 | 71518 | 70 | NA | NA | NA |
| factor | metformin | 0 | 71518 | 4 | NA | NA | NA |
| factor | metformin.pioglitazone | 0 | 71518 | 2 | NA | NA | NA |
| factor | metformin.rosiglitazone | 0 | 71518 | 2 | NA | NA | NA |
| factor | miglitol | 0 | 71518 | 4 | NA | NA | NA |
| factor | nateglinide | 0 | 71518 | 4 | NA | NA | NA |
| factor | payer_code | 31043 | 71518 | 17 | NA | NA | NA |
| factor | pioglitazone | 0 | 71518 | 4 | NA | NA | NA |
| factor | race | 1948 | 71518 | 5 | NA | NA | NA |
| factor | readmitted | 0 | 71518 | 3 | NA | NA | NA |
| factor | repaglinide | 0 | 71518 | 4 | NA | NA | NA |
| factor | rosiglitazone | 0 | 71518 | 4 | NA | NA | NA |
| factor | tolazamide | 0 | 71518 | 2 | NA | NA | NA |
| factor | tolbutamide | 0 | 71518 | 2 | NA | NA | NA |
| factor | troglitazone | 0 | 71518 | 2 | NA | NA | NA |
| factor | weight | 68665 | 71518 | 9 | NA | NA | NA |
| integer | admission_source_id | 0 | 71518 | NA | 5.66 | 4.16 | ▅▇▁▁▁▁▁▁ |
| integer | admission_type_id | 0 | 71518 | NA | 2.1 | 1.51 | ▇▃▃▁▁▁▁▁ |
| integer | discharge_disposition_id | 0 | 71518 | NA | 3.59 | 5.27 | ▇▂▁▁▁▁▁▁ |
| integer | encounter_id | 0 | 71518 | NA | 1.6e+08 | 1e+08 | ▅▇▇▅▃▂▁▁ |
| integer | num_lab_procedures | 0 | 71518 | NA | 43.08 | 19.95 | ▃▃▇▆▂▁▁▁ |
| integer | num_medications | 0 | 71518 | NA | 15.71 | 8.31 | ▆▇▂▁▁▁▁▁ |
| integer | num_procedures | 0 | 71518 | NA | 1.43 | 1.76 | ▇▃▂▂▁▁▁▁ |
| integer | number_diagnoses | 0 | 71518 | NA | 7.25 | 1.99 | ▁▂▅▃▇▁▁▁ |
| integer | number_emergency | 0 | 71518 | NA | 0.1 | 0.51 | ▇▁▁▁▁▁▁▁ |
| integer | number_inpatient | 0 | 71518 | NA | 0.18 | 0.6 | ▇▁▁▁▁▁▁▁ |
| integer | number_outpatient | 0 | 71518 | NA | 0.28 | 1.07 | ▇▁▁▁▁▁▁▁ |
| integer | patient_nbr | 0 | 71518 | NA | 5.5e+07 | 3.9e+07 | ▇▇▅▆▅▁▁▁ |
| integer | time_in_hospital | 0 | 71518 | NA | 4.29 | 2.95 | ▇▇▂▃▂▁▁▁ |
Now, we will drop some of the columns for various reasons: - Encounter ID: an uneccessary column that provides no information for readmission - Patient Number: an unecessary column that provides no information for readmission - Weight: While potentially useful, there is too many missing values - Payer Code: too many missing values - Medical Specialty: probably unecessary and too many missing values - Diabetes Medication: redundant information - diag_2 and diag_3: ICD codes are hard to deal with and often lack predictive power, so we will try to use only primary diagnosis
drops <- c("encounter_id", "patient_nbr", "weight", "payer_code",
"medical_specialty", "diabetesMed", "diag_2", "diag_3")
data <- data[ , !(names(data) %in% drops)]
Next, we need to drop any patients who’s discharge status is listed as “expired”. If a patient has died, there is no chance of readmission.
data <- filter(data, !(discharge_disposition_id %in% c(11, 19, 20, 21)))
We will also remove the 3 subjects with an unknown gender.
data <- data %>% filter(gender != "Unknown/Invalid")
data$gender <- droplevels(data$gender)
data$gender <- as.integer(data$gender) - 1
# Encoding ages
data$age <- ifelse(data$age == "[0-10)", 5,
ifelse(data$age == "[10-20)", 15,
ifelse(data$age == "[20-30)", 25,
ifelse(data$age == "[30-40)", 35,
ifelse(data$age == "[40-50)", 45,
ifelse(data$age == "[50-60)", 55,
ifelse(data$age == "[60-70)", 65,
ifelse(data$age == "[70-80)", 75,
ifelse(data$age == "[80-90)", 85, 95)))))))))
data$max_glu_serum <- ifelse(data$max_glu_serum == "None", 0,
ifelse(data$max_glu_serum == "Norm", 1, 2))
data$A1Cresult <- ifelse(data$A1Cresult == "None", 0,
ifelse(data$A1Cresult == "Norm", 1,
ifelse(data$A1Cresult %in% c(">7", ">8"), 2, NA)))
data$insulin <- ifelse(data$insulin == "No", 0, ifelse(data$insulin == "Down", 1, ifelse(data$insulin == "Steady", 2, 3)))
data$change <- ifelse(data$change == "Ch", 1, 0)
data$race <- ifelse(data$race == "Caucasian", 0, ifelse(data$race == "AfricanAmerican", 1, 2))
data$race[is.na(data$race)] <- 2
data$readmitted <- ifelse((data$readmitted == "NO" | data$readmitted == ">30"), 0, 1)
data %>% group_by(readmitted) %>% summarize(count=n())
## # A tibble: 2 x 2
## readmitted count
## <dbl> <int>
## 1 0 64138
## 2 1 6293
We see that we have a highly imbalanced class problem.
# Sum of number of outpatient, emergency and inpatient encounters
data$healthcare_utilization <- (data$number_outpatient + data$number_emergency + data$number_inpatient)
# Creating sum of other meds on column
# (not all of these columns have all four options, but it was easier to code this way - gives same result)
data$metformin <- ifelse(data$metformin %in% c("Up", "Steady", "Down"),1,0)
data$repaglinide <- ifelse(data$repaglinide %in% c("Up", "Steady", "Down"),1,0)
data$nateglinide <- ifelse(data$nateglinide %in% c("Up", "Steady", "Down"),1,0)
data$chlorpropamide <- ifelse(data$chlorpropamide %in% c("Up", "Steady", "Down"),1,0)
data$glimepiride <- ifelse(data$glimepiride %in% c("Up", "Steady", "Down"),1,0)
data$acetohexamide <- ifelse(data$acetohexamide %in% c("Up", "Steady", "Down"),1,0)
data$glipizide <- ifelse(data$glipizide %in% c("Up", "Steady", "Down"),1,0)
data$glyburide <- ifelse(data$glyburide %in% c("Up", "Steady", "Down"),1,0)
data$tolbutamide <- ifelse(data$tolbutamide %in% c("Up", "Steady", "Down"),1,0)
data$pioglitazone <- ifelse(data$pioglitazone %in% c("Up", "Steady", "Down"),1,0)
data$rosiglitazone <- ifelse(data$rosiglitazone %in% c("Up", "Steady", "Down"),1,0)
data$acarbose <- ifelse(data$acarbose %in% c("Up", "Steady", "Down"),1,0)
data$miglitol <- ifelse(data$miglitol %in% c("Up", "Steady", "Down"),1,0)
data$troglitazone <- ifelse(data$troglitazone %in% c("Up", "Steady", "Down"),1,0)
data$tolazamide <- ifelse(data$tolazamide %in% c("Up", "Steady", "Down"),1,0)
data$examide <- ifelse(data$examide %in% c("Up", "Steady", "Down"),1,0)
data$citoglipton <- ifelse(data$citoglipton %in% c("Up", "Steady", "Down"),1,0)
data$glyburide.metformin <- ifelse(data$glyburide.metformin %in% c("Up", "Steady", "Down"),1,0)
data$glipizide.metformin <- ifelse(data$glipizide.metformin %in% c("Up", "Steady", "Down"),1,0)
data$glimepiride.pioglitazone <- ifelse(data$glimepiride.pioglitazone %in% c("Up", "Steady", "Down"),1,0)
data$metformin.rosiglitazone <- ifelse(data$metformin.rosiglitazone %in% c("Up", "Steady", "Down"),1,0)
data$metformin.pioglitazone <- ifelse(data$metformin.pioglitazone %in% c("Up", "Steady", "Down"),1,0)
data$num_other_diabetes_meds_up_stdy_dwn <- (data$metformin + data$repaglinide + data$nateglinide + data$chlorpropamide + data$glimepiride + data$acetohexamide + data$glipizide + data$glyburide + data$tolbutamide + data$pioglitazone + data$rosiglitazone + data$acarbose + data$miglitol + data$troglitazone + data$tolazamide + data$examide + data$citoglipton + data$glyburide.metformin + data$glipizide.metformin + data$glimepiride.pioglitazone + data$metformin.rosiglitazone + data$metformin.pioglitazone)
drops2 <- c("metformin", "repaglinide", "nateglinide", "chlorpropamide", "glimepiride", "acetohexamide", "glipizide",
"glyburide", "tolbutamide", "pioglitazone", "rosiglitazone", "acarbose", "miglitol", "troglitazone",
"tolazamide", "examide", "citoglipton", "glyburide.metformin", "glipizide.metformin", "glimepiride.pioglitazone",
"metformin.rosiglitazone", "metformin.pioglitazone")
data <- data[ , !(names(data) %in% drops2)]
# set up dummy variables for admission types
data <- mutate(data, at_emergent = as.numeric(admission_type_id %in% c(1, 2, 7)),
at_elective = as.numeric(admission_type_id == 3),
at_other = as.numeric(admission_type_id %in% c(4, 5, 6, 8)))
# Set up dummy variables for discharge dispositions
data <- mutate(data, dd_home = as.numeric(discharge_disposition_id %in% c(1, 6, 8)),
dd_facility_transfer = as.numeric(discharge_disposition_id %in% c(2, 3, 4, 5, 10, 16, 17, 22, 23, 24, 30, 27, 28, 29, 13, 14)),
dd_other = as.numeric(discharge_disposition_id %in% c(7, 18, 25, 26, 9, 12, 15)))
# dd_admitted = as.numeric(discharge_disposition_id %in% c(9, 12, 15)), #lumping admitted into other
# add_expired = as.numeric(discharge_disposition_id %in% c(11, 19, 20, 21)), (dropped patients who expired)
# dd_hospice = as.numeric(discharge_disposition_id %in% c(13, 14))) #lumping hospice into transfer
# Set up dummy variables for admission source
data <- mutate(data, as_outpatient = as.numeric(admission_source_id %in% c(1, 2, 3)),
as_facility_transfer = as.numeric(admission_source_id %in% c(4, 5, 6, 10, 18, 19, 22, 25, 26)),
as_ed = as.numeric(admission_source_id %in% c(7)),
as_other = as.numeric(admission_source_id %in% c(8, 9, 15, 17, 20, 21, 11, 12, 13, 14, 23, 24)))
#as_newborn = as.numeric(admission_source_id %in% c(11, 12, 13, 14, 23, 24)))#as_newborn added to as_other
data <- within(data, rm(admission_type_id))
data <- within(data, rm(discharge_disposition_id))
data <- within(data, rm(admission_source_id))
data$diag_1 <- as.character(data$diag_1)
data$diag_1grp <- ifelse (data$diag_1 == "?", "Unknown",
ifelse(grepl(data$diag_1, pattern = "[EV]") == T, "External",
ifelse(floor(as.numeric(data$diag_1)) == 250, "Circulatory",
ifelse(data$diag_1 %in% c(390:459, 785), "Diabetes",
ifelse(data$diag_1 %in% c(460:519, 786), "Respiratory",
ifelse(data$diag_1 %in% c(520:579, 787), "Digestive",
ifelse(data$diag_1 %in% c(800:999), "Injury",
ifelse(data$diag_1 %in% c(710:739), "Musculoskeletal",
ifelse(data$diag_1 %in% c(580:629, 788), "Genitourinary",
ifelse(data$diag_1 %in% c(140:239), "Neoplasm",
ifelse(data$diag_1 %in% c(780, 781, 784, 790:799, 740:759), "Other", # added congenital to other
ifelse(data$diag_1 %in% c(240:249, 251:279), "Endocrine_Nutrition_Metabolic",
ifelse(data$diag_1 %in% c(680:709, 782), "Skin",
ifelse(data$diag_1 %in% c(001:139), "Infectious",
ifelse(data$diag_1 %in% c(290:319), "Mental",
ifelse(data$diag_1 %in% c(280:289), "Blood",
ifelse(data$diag_1 %in% c(320:359), "Nervous",
ifelse(data$diag_1 %in% c(630:679), "Pregnancy",
ifelse(data$diag_1 %in% c(360:389), "Sense", "Unknown")
#ifelse(data$diag_1 %in% c(740:759), "Congenital", "NULL")
))))))))))))))))))
## Warning in ifelse(floor(as.numeric(data$diag_1)) == 250, "Circulatory", :
## NAs introduced by coercion
data$diag_1grp <- as.factor(data$diag_1grp)
data <- one_hot(data.table(data))
data <- as.data.frame(data) # convert back to data.frame format
data <- within(data, rm(diag_1))
data <- na.omit(data)
To get us started, we will try some dimensionality reduction techniques to see if there is any separation between our two classes. We start with principal component analysis (PCA) which uses an orthogonal transformation to convert the original variables into a set of linear combinations of the original variables, called principal components. PCA is commonly used as a tool in exploratory data analysis.
We also use the popular T-distributed Stochastic Neighbor Embedding (t-SNE) algorithm, which is a machine learning visualization tool. It is a non-linear dimensionality reduction technique that models each high-dimensional observation by a two-dimensional point in such a way that similar observations are modeled by nearby points with high probability.
pca_tsne_df <- data[,c(1:41)]
# scale variables
for(i in 1:ncol(pca_tsne_df)){
pca_tsne_df[[i]] <- ((pca_tsne_df[[i]] - mean(pca_tsne_df[[i]])) / sd(pca_tsne_df[[i]]))
}
pca_tsne_df <- t(pca_tsne_df)
#PCA
PC <- prcomp(pca_tsne_df)
var_explained <- data.frame("PC" = c(1:41), "eigenvalue" = (PC$sdev)^2)
var_explained$var_explained <- var_explained$eigenvalue / sum(var_explained$eigenvalue)
var_explained$cumulative <- cumsum(var_explained$var_explained) / sum(var_explained$var_explained)
# ggplot(var_explained, aes(x = PC, y = var_explained)) +
# geom_line() +
# labs(title = "Elbow plot: proportion of variance explained by each PC")
#
# ggplot(var_explained, aes(x = PC, y = cumulative)) +
# geom_line() +
# labs(title = "Cumulative variance explained by PC_1 through PC_n",
# x = "n", y = "var_explained")
p1 <- ggplot(var_explained, aes(x = PC, y = var_explained)) +
geom_line() +
labs(title = "Elbow plot: proportion of variance \nexplained by each PC") +
theme(panel.grid.minor = element_blank())
p2 <- ggplot(var_explained, aes(x = PC, y = cumulative)) +
geom_line() +
labs(title = "Cumulative proportion of variance \nexplained by PC_1 through PC_n",
x = "PC", y = "variance explained") +
theme(axis.title.y = element_blank(), panel.grid.minor = element_blank())
ggsave(grid.arrange(p1, p2, ncol = 2), filename = "docs/images/pcastats_grid.png", width = 9, height = 4)
outdf <- cbind(data, PC$rotation)
outdf$readmitted <- as.factor(outdf$readmitted)
ggplot(outdf, aes(x = PC1, y = PC2)) +
geom_point(aes(color = readmitted), alpha = 0.5) +
scale_color_manual(values = c("dodgerblue", "firebrick1")) +
labs(title = "Classes are not well-separated by principal components")
# TSNE
tsne_model_1 = Rtsne(t(as.matrix(pca_tsne_df)), check_duplicates=FALSE, pca=TRUE,perplexity = 10, theta=0.5, dims=2)
## getting the two dimension matrix
d_tsne_1 = as.data.frame(tsne_model_1$Y)
d_tsne_1$readmitted <- as.factor(data$readmitted)
## plotting the results without clustering
ggplot(d_tsne_1, aes(x=V1, y=V2)) +
geom_point(aes(color = readmitted)) +
scale_color_manual(values = c("dodgerblue", "firebrick1")) +
labs(title = "Classes are not well-separated by t-SNE")
We do not see any class separation of readmitted classes by PCA or t-SNE analysis.
# script for making grid of plots for website
p1 <- ggplot(data=data, aes(x=factor(readmitted, labels = c("No Readmission \nwithin 30 Days", "Readmission \nin < 30 Days")), y=num_lab_procedures)) +
geom_boxplot(fill="firebrick1", alpha=0.6) +
labs(y="Number of Lab Procedures", x="Readmission Status", title="Number of Lab Procedures", fill="Readmission Status") +
theme(axis.title.x = element_blank(), plot.margin = margin(0.5,0.5,0.5,0.5,"cm"))
p2 <- ggplot(data=data, aes(x=factor(readmitted, labels = c("No Readmission \nwithin 30 Days", "Readmission \nin < 30 Days")), y=num_procedures)) +
geom_boxplot(fill="firebrick1", alpha=0.6) +
labs(y="Number of Non-Lab Procedures", x="Readmission Status", title="Number of Non-Lab Procedures", fill="Readmission Status") +
theme(axis.title.x = element_blank(), plot.margin = margin(0.5,0.5,0.5,0.5,"cm"))
p3 <- ggplot(data=data, aes(x=factor(readmitted, labels = c("No Readmission \nwithin 30 Days", "Readmission \nin < 30 Days")), y=num_medications)) +
geom_boxplot(fill="firebrick1", alpha=0.6) +
labs(y="Number of Medications Administered", x="Readmission Status", title="Number of Medications Administered", fill="Readmission Status") +
theme(axis.title.x = element_blank(), plot.margin = margin(0.5,0.5,0.5,0.5,"cm"))
p4 <- ggplot(data=data, aes(x=factor(readmitted, labels = c("No Readmission \nwithin 30 Days", "Readmission \nin < 30 Days")), y=healthcare_utilization)) +
geom_boxplot(fill="firebrick1", alpha=0.6) +
labs(y="Healthcare Utilization", x="Readmission Status", title="Healtchare Utilization", fill="Readmission Status") +
theme(axis.title.x = element_blank(), plot.margin = margin(0.5,0.5,0.5,0.5,"cm"))
p5 <- ggplot(data=data, aes(x=factor(readmitted, labels = c("No Readmission \nwithin 30 Days", "Readmission \nin < 30 Days")), y=num_other_diabetes_meds_up_stdy_dwn)) +
geom_boxplot(fill="firebrick1", alpha=0.6) +
labs(y="Number of Diabetes Medications Changed", x="Readmission Status", title="Number of Diabetes Medications Changed", fill="Readmission Status") +
theme(axis.title.x = element_blank(), plot.margin = margin(0.5,0.5,0.5,0.5,"cm"))
p6 <- ggplot(data=data, aes(x=factor(readmitted, labels = c("No Readmission \nwithin 30 Days", "Readmission \nin < 30 Days")), y=number_diagnoses)) +
geom_boxplot(fill="firebrick1", alpha=0.6) +
labs(y="# of Diagnoses", x="Readmission Status", title="Number of Diagnoses", fill="Readmission Status") +
theme(axis.title.x = element_blank(), plot.margin = margin(0.5,0.5,0.5,0.5,"cm"))
ggsave(grid.arrange(p1, p2, p3, p4, p5, p6, ncol=2), filename = "docs/images/feature_grid.png", width = 8.5, height = 11)
ggplot(data=data, aes(x=factor(race, labels = c("Caucasian", "African American", "Other")),
fill=factor(readmitted,
labels = c("No Readmission within 30 Days", "Readmission in < 30 Days")))) +
geom_bar(position=position_dodge(), color = "black") +
labs(title="Categorywise Bar Chart of Race", x="Race", y="Count", fill="Readmission Status")
ggplot(data=data, aes(x=factor(gender, labels = c("Female", "Male")),
fill=factor(readmitted,
labels = c("No Readmission within 30 Days", "Readmission in < 30 Days")))) +
geom_bar(position=position_dodge(), color = "black") +
labs(title="Categorywise Bar Chart of Gender", x="Gender", y="Count", fill="Readmission Status")
ggplot(data=data, aes(x=factor(A1Cresult, labels = c("No Measurement", "Normal Reading", "Abnormal")),
fill=factor(readmitted,
labels = c("No Readmission within 30 Days", "Readmission in < 30 Days")))) +
geom_bar(position=position_dodge(), color = "black") +
labs(title="Categorywise Bar Chart of A1C Test Result", x="A1C", y="Count", fill="Readmission Status")
ggplot(data=data, aes(x=factor(insulin, labels = c("No Insulin", "Decrease in Insulin", "Steady Insulin", "Increase in Insulin")),
fill=factor(readmitted,
labels = c("No Readmission within 30 Days", "Readmission in < 30 Days")))) +
geom_bar(position=position_dodge(), color = "black") +
labs(title="Categorywise Bar Chart of Insulin Change", x="Insulin Change", y="Count", fill="Readmission Status")
ggplot(data=data, aes(x=factor(change, labels = c("No Change", "Change in Medication")),
fill=factor(readmitted,
labels = c("No Readmission within 30 Days", "Readmission in < 30 Days")))) +
geom_bar(position=position_dodge(), color = "black") +
labs(title="Categorywise Bar Chart of Medication Change", x="Medication Change", y="Count", fill="Readmission Status")
ggplot(data=data, aes(x=time_in_hospital,
fill=factor(readmitted,
labels = c("No Readmission within 30 Days", "Readmission in < 30 Days")))) +
geom_density(alpha=0.5) +
labs(x="Time in Hospital", y="Density",
title="Distribution of Time in Hospital Grouped by Readmission Status", fill="Readmission Status")
ggplot(data=data, aes(x=factor(readmitted,
labels = c("No Readmission within 30 Days", "Readmission in < 30 Days")),
y=time_in_hospital)) +
geom_boxplot(fill="firebrick1", alpha=0.6) +
labs(y="Time in Hospital", x="Readmission Status",
title="Distribution of Time in Hospital Grouped by Readmission Status", fill="Readmission Status")
ggplot(data=data, aes(x=num_lab_procedures,
fill=factor(readmitted,
labels = c("No Readmission within 30 Days", "Readmission in < 30 Days")))) +
geom_density(alpha=0.5) +
labs(x="# of Lab Proedures", y="Density",
title="Distribution of # of Lab Procedures Grouped by Readmission Status", fill="Readmission Status")
ggplot(data=data, aes(x=factor(readmitted,
labels = c("No Readmission within 30 Days", "Readmission in < 30 Days")),
y=num_lab_procedures)) +
geom_boxplot(fill="firebrick1", alpha=0.6) +
labs(y="# of Lab Procedures", x="Readmission Status",
title="Distribution of # of Lab Procedures Grouped by Readmission Status", fill="Readmission Status")
ggplot(data=data, aes(x=num_procedures,
fill=factor(readmitted,
labels = c("No Readmission within 30 Days", "Readmission in < 30 Days")))) +
geom_density(alpha=0.5) +
labs(x="# of Non-Lab Proedures", y="Density",
title="Distribution of # of Non-Lab Procedures Grouped by Readmission Status", fill="Readmission Status")
ggplot(data=data, aes(x=factor(readmitted,
labels = c("No Readmission within 30 Days", "Readmission in < 30 Days")),
y=num_procedures)) +
geom_boxplot(fill="firebrick1", alpha=0.6) +
labs(y="# of Non-Lab Procedures", x="Readmission Status",
title="Distribution of # of Non-Lab Procedures Grouped by Readmission Status", fill="Readmission Status")
ggplot(data=data, aes(x=num_medications,
fill=factor(readmitted,
labels = c("No Readmission within 30 Days", "Readmission in < 30 Days")))) +
geom_density(alpha=0.5) +
labs(x="# of Medications Administered", y="Density",
title="Distribution of # of Medications Administered Grouped by Readmission Status", fill="Readmission Status")
ggplot(data=data, aes(x=factor(readmitted,
labels = c("No Readmission within 30 Days", "Readmission in < 30 Days")),
y=num_medications)) +
geom_boxplot(fill="firebrick1", alpha=0.6) +
labs(y="# of Medications Administered", x="Readmission Status",
title="Distribution of # Medications Administered Grouped by Readmission Status", fill="Readmission Status")
ggplot(data=data, aes(x=number_diagnoses,
fill=factor(readmitted,
labels = c("No Readmission within 30 Days", "Readmission in < 30 Days")))) +
geom_density(alpha=0.5) +
labs(x="# of Diagnoses", y="Density",
title="Distribution of # of Diagnoses Grouped by Readmission Status", fill="Readmission Status")
ggplot(data=data, aes(x=factor(readmitted,
labels = c("No Readmission within 30 Days", "Readmission in < 30 Days")),
y=number_diagnoses)) +
geom_boxplot(fill="firebrick1", alpha=0.6) +
labs(y="# of Diagnoses", x="Readmission Status",
title="Distribution of # of Diagnoses Grouped by Readmission Status", fill="Readmission Status")
ggplot(data=data, aes(x=age,
fill=factor(readmitted,
labels = c("No Readmission within 30 Days", "Readmission in < 30 Days")))) +
geom_density(alpha=0.5) +
labs(x="End of Age Interval", y="Density",
title="Distribution of End Age of Interval Grouped by Readmission Status", fill="Readmission Status")
ggplot(data=data, aes(x=factor(readmitted,
labels = c("No Readmission within 30 Days", "Readmission in < 30 Days")),
y=age)) +
geom_boxplot(fill="firebrick1", alpha=0.6) +
labs(y="End of Age Interval", x="Readmission Status",
title="Distribution of End Age of Interval Grouped by Readmission Status", fill="Readmission Status")
ggplot(data=data, aes(x=healthcare_utilization,
fill=factor(readmitted,
labels = c("No Readmission within 30 Days", "Readmission in < 30 Days")))) +
geom_density(alpha=0.5) +
labs(x="Healtchare Utilization", y="Density",
title="Distribution of Healthcare Utilization by Readmission Status", fill="Readmission Status")
ggplot(data=data, aes(x=factor(readmitted,
labels = c("No Readmission within 30 Days", "Readmission in < 30 Days")),
y=healthcare_utilization)) +
geom_boxplot(fill="firebrick1", alpha=0.6) +
labs(y="Healthcare Utilization", x="Readmission Status",
title="Distribution of Healtchare Utilization Grouped by Readmission Status", fill="Readmission Status")
ggplot(data=data, aes(x=num_other_diabetes_meds_up_stdy_dwn,
fill=factor(readmitted,
labels = c("No Readmission within 30 Days", "Readmission in < 30 Days")))) +
geom_density(alpha=0.5) +
labs(x="# of Diabetes Medications Changed", y="Density",
title="Distribution of # of Diabetes Medications Changed Grouped by Readmission Status", fill="Readmission Status")
ggplot(data=data, aes(x=factor(readmitted,
labels = c("No Readmission within 30 Days", "Readmission in < 30 Days")),
y=num_other_diabetes_meds_up_stdy_dwn)) +
geom_boxplot(fill="firebrick1", alpha=0.6) +
labs(y="# of Diabetes Medications Changed", x="Readmission Status",
title="Distribution of # of Diabetes Medications Changed Grouped by Readmission Status", fill="Readmission Status")
We saw earlier that we have a large imbalanced class problem. Approximately 90% of our observations did not have a hosptial readmission within 30 days. This is about what we would expect given the problem domain. Most people will not be readmitted to the hospital, but a small subset of people will. Because of this, it is important to properly adjust the metrics and methods used to achieve our goal of predicting hosptial readmission.
Typically in a classification problem, one cares most about achieving accuracy. However, given that 90% of our points belong to one class, we could achieve 90% accuracy by just predicting the majority class every time. This isn’t useful though because we are most interested in identifying instances of the miniority class, those who will have a hosptial readmission. In this case, it is more important to optimize for a metric like recall, precision, F1 score, and AUC. We will use these metrics, namely AUC, to compare our models.
One way to help improve the training process in an imbalanced class problem is to try to rebalance them by oeither versampling instances of the minority class or undersampling instances of the majority class. This should, in theory, lead to classifiers which are not biased torwards one class or another. One popular techinque for doing this is the [Synthetic Minority Over-sampling Technique (SMOTE)] (https://arxiv.org/pdf/1106.1813.pdf).
SMOTE is a method which creates new instances of the minority class by forming convex combinations of neighboring instances. It effectively draws lines between minority points in the feature space, and samples along these lines. This allows us to balance our data-set without as much overfitting, as we create new synthetic examples rather than using duplicates. Regardless, there is still the potential for overfitting given that we are still creating data from existing data points.
if(file.exists("dataset/trainSMOTE.csv")) {
trainData_SMOTE <- read.csv("dataset/trainSMOTE.csv")
} else {
trainData$readmitted <- as.factor(trainData$readmitted)
trainData_SMOTE <- SMOTE(readmitted ~ ., trainData)
write.csv(trainData_SMOTE, "dataset/trainSMOTE.csv", row.names = F)
}
Our problem is a classic example of a two-way classification supervised learning problem. We will use three different machine learning models to try to predict hospital readmission:
1. XGBoost
2. KNN
3. GLMnet
XGBoost stands for eXtreme Gradient Boosting. As indicated by the name, it is a boosting method that combines the output of many different Classification And Regression Trees (CART). CARTs are like decision trees except that in addition to simply producing a classification they attach a score to each individual leaf. When we combine several trees the values attached to a particular observation are added together. XGBoost’s classification and regression outputs are based upon the output of the sum of these values. Training consists of selecting a new tree to add to the ensemble based upon which one best optimizes the loss function. Hence, why it is called gradient boosting; we are adding the trees that take a step in the best direction for optimizing the objective function.
# create XGB Boost dense matrix objects --> faster performance
dtrain <- xgb.DMatrix(data = as.matrix(trainData_SMOTE[, 1:41]), label = as.numeric(trainData_SMOTE$readmitted))
dtest <- xgb.DMatrix(data = as.matrix(testData[, 1:41]), label = as.numeric(testData$readmitted))
In machine learning, hyperparameter optimization is the problem of choosing a set of optimal hyperparameters for a learning algorithm. A hyperparameter is a parameter whose value is set before the learning process begins. By contrast, the values of other parameters are derived via training.
if(file.exists("dataset/xgb_params.csv")) {
best_params <- read.csv("dataset/xgb_params.csv")
} else {
param_grid <- expand.grid(objective = "binary:logistic",
eval_metric = "auc",
eta = c(0.1, 0.3, 0.5),
max_depth = c(2,4,6),
gamma = c(2,3,4),
scale_pos_weight = c(2, 3))
best_test_auc <- 0
best_params <- param_grid[1, ]
for (i in 1:nrow(param_grid)) {
print(i)
xgb_cv <- xgb.cv(params = param_grid[i, ], data = dtrain, nrounds = 150,
nfold = 10, early_stopping_rounds = 5, verbose = F)
if(xgb_cv$evaluation_log[nrow(xgb_cv$evaluation_log), test_auc_mean] > best_test_auc) {
best_test_auc = xgb_cv$evaluation_log[nrow(xgb_cv$evaluation_log), test_auc_mean]
best_params <- param_grid[i, ]
}
}
write.csv(best_params, "dataset/xgb_params.csv", row.names = F)
}
xgb <- xgb.train(params = best_params, data = dtrain, nrounds = 500)
xgbPredict <- predict(xgb, dtest)
xgbPredict <- as.numeric(xgbPredict> 0.5)
confusionMatrix(reference = as.factor(testData$readmitted), data = as.factor(xgbPredict), mode = "everything", positive="1")
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 9761 753
## 1 3033 537
##
## Accuracy : 0.7312
## 95% CI : (0.7238, 0.7385)
## No Information Rate : 0.9084
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.0999
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.41628
## Specificity : 0.76294
## Pos Pred Value : 0.15042
## Neg Pred Value : 0.92838
## Precision : 0.15042
## Recall : 0.41628
## F1 : 0.22099
## Prevalence : 0.09159
## Detection Rate : 0.03813
## Detection Prevalence : 0.25348
## Balanced Accuracy : 0.58961
##
## 'Positive' Class : 1
##
We see from this first model that we are not performing very well. We have an accuracy of 73.1%, precision of 15.04%, recall of 41.6%, and a F1 score of 0.22.
importance <- xgb.importance(model = xgb)
xgb.ggplot.importance(importance_matrix = importance, top_n = 10) +
theme(axis.text = element_text(size = 12),
plot.title = element_text(size=14),
panel.grid.minor = element_blank(),
panel.grid.major = element_line(size=0.25),
axis.title.y = element_blank())
According to the XGBoost algorithm, healthcare utilization, age, and time in hospital are the three most important factors for predicting hospital readmission.
pred <- prediction(xgbPredict, testData$readmitted)
AUC <- performance(pred, "auc")@y.values[[1]]
perf_ROC=performance(pred,"tpr","fpr") #plot the actual ROC curve
plot(perf_ROC, main="ROC plot")
text(0.9, 0.0,paste("AUC = ",format(AUC, digits=3, scientific=FALSE)))
We see that XGBoost produces a testing AUC-ROC of 0.59.
KNN is an algorithm that takes in numerical features, then calculates the distance between a given observation and all the other observations in the dataset, then chooses the predicted label for the given point based on what label the majority of that point’s “K nearest neighbors” (K points closest to the point of interest) possess. For this reason it is important to choose a \(K\) in such a way as to avoid tie votes among the “K neighbors.”
ctrl <- trainControl(method="cv", number = 10)
cl <- makePSOCKcluster(detectCores())
registerDoParallel(cl)
knn <- train(as.factor(readmitted) ~ ., data = trainData_SMOTE, method = "knn", metric = "kappa",
trControl = ctrl, preProcess = c("center","scale"), tuneLength = 5)
## Warning in train.default(x, y, weights = w, ...): The metric "kappa" was
## not in the result set. Accuracy will be used instead.
stopCluster(cl)
plot(knn)
We see that even just testing KNN on 5 different k values that our performance is decreasing with increased values of k. Therefore, we choose not to test any more.
knnPredict <- predict(knn, testData)
confusionMatrix(reference = as.factor(testData$readmitted), data = as.factor(knnPredict), mode = "everything", positive="1")
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 8111 704
## 1 4683 586
##
## Accuracy : 0.6175
## 95% CI : (0.6094, 0.6255)
## No Information Rate : 0.9084
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.037
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.45426
## Specificity : 0.63397
## Pos Pred Value : 0.11122
## Neg Pred Value : 0.92014
## Precision : 0.11122
## Recall : 0.45426
## F1 : 0.17869
## Prevalence : 0.09159
## Detection Rate : 0.04161
## Detection Prevalence : 0.37411
## Balanced Accuracy : 0.54412
##
## 'Positive' Class : 1
##
We see from this model that we are still not performing very well. We have an accuracy of 61.78%, precision of 11.11%, recall of 45.42%, and a F1 score of 0.18.
knnPred <- prediction(as.numeric(knnPredict), testData$readmitted)
knn_AUC <- performance(knnPred, "auc")@y.values[[1]]
knn_perf_ROC=performance(knnPred,"tpr","fpr") #plot the actual ROC curve
plot(knn_perf_ROC, main="ROC plot")
text(0.9, 0.0,paste("AUC = ",format(knn_AUC, digits=3, scientific=FALSE)))
We see that KNN produces a testing AUC-ROC of 0.544.
GLMnet fits a generalized linear model via penalized maximum likelihood. The algorithm implemented in the package computes the regularization path for the elastic-net penalty over a grid of values for the regularization parameter \(\lambda\) (https://web.stanford.edu/~hastie/glmnet/glmnet_alpha.html#log). The elastic-net penalty is controlled by α , and bridges the gap between lasso (α=1, the default) and ridge (α=0). The tuning parameter λ controls the overall strength of the penalty. To reduce noise caused by data-splitting, we used 10-fold cross-validation for tuning parameter values using the caret package as described here (http://www.milanor.net/blog/cross-validation-for-predictive-analytics-using-r/).
trainData_SMOTE$readmitted <- as.factor(trainData_SMOTE$readmitted)
#cv_splits <- createFolds(trainData_SMOTE$readmitted, k = 10, returnTrain = TRUE)
glmnet_grid <- expand.grid(alpha = c(0, .1, .2, .4, .6, .8, 1),
lambda = seq(.01, .2, length = 10))
glmnet_ctrl <- trainControl(method = "cv", number = 20)
cl <- makePSOCKcluster(detectCores())
registerDoParallel(cl)
glmnet_fit <- train(readmitted ~ ., data = trainData_SMOTE,
method = "glmnet",
family = "binomial",
preProcess = c("center", "scale"),
tuneGrid = glmnet_grid,
trControl = glmnet_ctrl)
stopCluster(cl)
# Best tuning parameters
glmnet_fit$bestTune
## alpha lambda
## 1 0 0.01
The best tuning parameter chosen after 10-fold cross validation is \(\alpha = 0, \lambda = 0.01\). This means that we are effectively using a ridge regression.
trellis.par.set(caretTheme())
plot(glmnet_fit, scales = list(x = list(log = 2)))
pred_classes <- predict(glmnet_fit, newdata = testData, type = "raw")
confusionMatrix(data = as.factor(pred_classes), reference = as.factor(testData$readmitted), mode = "everything", positive="1")
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 10247 825
## 1 2547 465
##
## Accuracy : 0.7606
## 95% CI : (0.7534, 0.7676)
## No Information Rate : 0.9084
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.1009
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.36047
## Specificity : 0.80092
## Pos Pred Value : 0.15438
## Neg Pred Value : 0.92549
## Precision : 0.15438
## Recall : 0.36047
## F1 : 0.21618
## Prevalence : 0.09159
## Detection Rate : 0.03302
## Detection Prevalence : 0.21386
## Balanced Accuracy : 0.58069
##
## 'Positive' Class : 1
##
Even in our last model, we are still not performing very well. We have an accuracy of 76.06%, precision of 15.4%, recall of 36.84%, and a F1 score of 0.21.
plot(varImp(glmnet_fit,scale=F), top = 10)
coef(glmnet_fit$finalModel, s = glmnet_fit$bestTune$lambda)
## 42 x 1 sparse Matrix of class "dgCMatrix"
## 1
## (Intercept) -0.2971526132
## race -0.0764752202
## gender 0.0139492093
## age 0.0757599572
## time_in_hospital 0.0392643694
## num_lab_procedures 0.0736108053
## num_procedures -0.0572806512
## num_medications 0.0472362874
## number_emergency 0.0426201602
## number_inpatient 0.2429572892
## number_diagnoses 0.0726062034
## max_glu_serum -0.0104628308
## A1Cresult -0.0816749941
## insulin 0.0419668317
## change 0.0201164513
## healthcare_utilization 0.0044873986
## num_other_diabetes_meds_up_stdy_dwn -0.0252987930
## at_elective -0.0439580950
## dd_facility_transfer 0.2723021578
## dd_other 0.0582020860
## as_facility_transfer -0.0557204345
## as_ed -0.0395831948
## as_other -0.0443884520
## diag_1grp_Blood -0.0181245675
## diag_1grp_Circulatory 0.0353631182
## diag_1grp_Diabetes 0.0583229720
## diag_1grp_Digestive 0.0058505936
## diag_1grp_Endocrine_Nutrition_Metabolic -0.0059598295
## diag_1grp_External 0.0248698499
## diag_1grp_Genitourinary -0.0090824450
## diag_1grp_Infectious -0.0227158656
## diag_1grp_Injury -0.0021541673
## diag_1grp_Mental 0.0196700772
## diag_1grp_Musculoskeletal -0.0244841718
## diag_1grp_Neoplasm 0.0003117428
## diag_1grp_Nervous 0.0003644846
## diag_1grp_Other -0.0215781812
## diag_1grp_Pregnancy -0.0146363069
## diag_1grp_Respiratory -0.0593963844
## diag_1grp_Sense -0.0494156823
## diag_1grp_Skin -0.0137203141
## diag_1grp_Unknown -0.0177019384
The most important features for predicting hospital readmission according to GLMnet are whether of not a patient was transfered to another facility at dischagre, the number of inpatient visits in the past year for a patient, and their A1C test result.
glmPred <- prediction(as.numeric(pred_classes), testData$readmitted)
glm_AUC <- performance(glmPred, "auc")@y.values[[1]]
glm_perf_ROC=performance(glmPred,"tpr","fpr") #plot the actual ROC curve
plot(glm_perf_ROC, main="ROC plot")
text(0.9, 0.0,paste("AUC = ",format(glm_AUC, digits=3, scientific=FALSE)))
We see that GLMnet produces a testing AUC-ROC of 0.581.
| ML.Model | AUC | Recall | Precision | F1.Score |
|---|---|---|---|---|
| XGBoost | 0.590 | 0.416 | 0.150 | 0.221 |
| KNN | 0.544 | 0.454 | 0.111 | 0.179 |
| GLMnet | 0.581 | 0.360 | 0.154 | 0.216 |
As we can see, none of our models were very succesful at predicting hospital readmission. Based on the above metrics, it appears that XGBoost performs the best, but it is still not very good. We see that we only correctly predict 41.6% of the cases that will be readmitted, and only 15% of the patients predicted to be readmitted actually will be. In the end, it seems like the current features from medical claims data are not sufficient in predicting hospital readmission. This can be seen in our exploratory data analysis as well where there appeared to be no real separation between classes across different predictor variables.